Lab 3

3DS : Cherry Blossom

Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser

Complete the following lab as a group. This document should exist in your GitHub repo while you’re working on it. Your code should be heavily commented so someone reading your code can follow along easily. See the first code snippet below for an example of commented code.

Here’s the catch: For any given problem, the person writing the code should not be the person commenting that code, and every person must both code AND comment at least one problem in this lab (you decide how to split the work). This will involve lots of pushing and pulling through Git, and you may have to resolve conflicts if you’re not careful!

ALSO, all plots generated should have labeled axes, titles, and legends when appropriate. Don’t forget units of measurement! Make sure these plots could be interpreted by your client.

# load packages
library(dplyr)
library(ggplot2)
library(lubridate)
library(chron)
library(plotly)
library(purrr)

Cherry Blossom Race Plotting Problems

Question 1

  1. Looking at race times all on their own.
    1. Import the data from CBdata.1_10.RData and combine all 10 year’s worth of data into one data frame, removing observations with missing age or time data (this should be very similar to previous labs). For each year, create a histogram of race times and plot a the corresponding density curve in the same figure. Arrange all ten of these plots into one figure

I am modifying cleaned_data from lab2

# load cherry blossom data
load("CBdata.1_10.RData")

# create one df and remove 'Pis/Tis' column and NAs from 'Age' and 'Time'
cleaned_data <- bind_rows(CBdata.1_10) %>%
  filter(!is.na(Age)) %>%
  filter(!is.na(Time) & Time != "") %>%
  select(-'PiS/TiS')

# print out random observations to check
print(cleaned_data[c(137, 180, 5404), ])
##          Race             Name Age    Time Pace Division PiD/TiD      Hometown
## 137  1974 10M  Joris Hogan (M)  28 1:03:97 6:28    M2529      NR            NR
## 180  1974 10M John Gibbons (M)  23 1:07:91 6:51    M2024      NR            NR
## 5404 1979 10M  Erik Meyers (M)  29 1:02:49 6:17    M2529  99/438 Arlington, VA
##      Year
## 137  1974
## 180  1974
## 5404 1979
# Remove any none-numeric and non-column characters from 'Time'
cleaned_data$Time <- gsub("[^0-9:]", "", cleaned_data$Time)

# Remove NA's from 'Time' 
cleaned_data <- cleaned_data[!is.na(cleaned_data$Time), , drop = FALSE]

# Remove empty strings from 'Time'
cleaned_data <- cleaned_data[cleaned_data$Time != '', , drop = FALSE]

# Converts 'Time' column to a time format
cleaned_data$Time <- chron::times(cleaned_data$Time)
## Warning in convert.times(times., fmt): time-of-day entries out of range in
## positions 137,180,5404 set to NA
# Removes any potential NA's generated from the conversion above
cleaned_data <- cleaned_data[!is.na(cleaned_data$Time), , drop = FALSE]
# calculate the mean, min, max per 'Year' for 'Time'
summary_by_year <- cleaned_data %>%
  group_by(Year) %>%
  summarize(
    mean_Time = mean(Time),
    min_Time = min(Time),
    max_Time = max(Time),
    participants = n()
  )

# print out summary generated above to have an idea of our data distribution
print(summary_by_year)
## # A tibble: 10 × 5
##     Year mean_Time min_Time max_Time participants
##    <int> <times>   <times>  <times>         <int>
##  1  1973 00:58:12  00:55:14 01:01:11            2
##  2  1974 01:09:48  00:50:50 01:50:26          330
##  3  1975 01:09:58  00:51:47 01:39:45          239
##  4  1976 01:09:40  00:49:09 01:58:00          481
##  5  1977 01:09:38  00:49:44 02:44:16          720
##  6  1978 01:02:13  00:48:57 01:09:17          700
##  7  1979 01:17:00  00:48:00 02:10:30         2975
##  8  1980 01:13:21  00:47:08 02:05:57         1617
##  9  1981 01:16:32  00:47:17 02:08:47         3338
## 10  1982 01:16:20  00:49:29 02:01:39         3180
# Creates function that creates histogram and density plot pairs for dif years
# Takes in 2 arguments: 'plot_year' and 'cleaned_data'
plot_histogram_density <- function(plot_year, cleaned_data) {
  
  # pulls out data from 'cleaned_data' based on 'plot_year'
  subset_data <- cleaned_data[cleaned_data$Year == plot_year, ]
  
  # makes histogram of data without plotting it
  hist_data <- hist(subset_data$Time, breaks = 10, plot = FALSE)
  
  # plots above generated histogram with color and labels
  plot(hist_data, col = "lightblue",
       main = paste("", plot_year),
       xlab = "Time (hr:min)",
       ylab = "Density",
       freq = FALSE, xaxt = "n")
  # custom x-axis with labels as hours and minutes
  axis(1, at = hist_data$breaks,
       labels = chron(times = hist_data$breaks, format = "h:mm"),
       las = 2)
  # adds density plot line in blue over histogram
  lines(density(subset_data$Time), col = "blue", lwd = 2)
}

# creates 2x5 martix layout for plots
par(mfrow = c(2, 5))

# loops over years 1973:1982 and calls function above to 
# generate histogram and density plot
for (year in 1973:1982) {
  plot_histogram_density(year, cleaned_data)
}

There were only two participants in 1973 with age and time recorded.

  1. Plot the density curves for all ten years in one figure, along with a density curve of the combined data set (for a total of 11 curves). The main focus should be displaying the combined density, but all 11 densities should be discernible.
density_data <- density(cleaned_data$Time)


years <- unique(cleaned_data$Year)


line_colors <- c("red", "#ffd92f", "orange", "firebrick", "magenta", 
                 "purple", "blue", "cornflowerblue", "darkgreen", "lightgreen")


custom_ticks <- c("00:42", "01:13", "01:45", "02:17", "02:49")
tick_positions <- seq(min(density_data$x), max(density_data$x), 
                      length.out = length(custom_ticks))
tick_labels <- as.numeric(tick_positions)


p <- plot_ly() %>%
  add_trace(type = "scatter",
            mode = "lines",
            x = density_data$x,
            y = density_data$y,
            line = list(color = "black", width = 5),
            name = "All Years") %>%
  layout(title = "Density Curve by Year",
         xaxis = list(title = "Time (hr:min)", showline = TRUE, 
                      tickvals = tick_positions, ticktext = custom_ticks),
         yaxis = list(title = "Density", showline = TRUE),
         showlegend = TRUE)


for (i in seq_along(years)) {
  year <- years[i]
  subset_data <- cleaned_data[cleaned_data$Year == year, ]
  p <- add_trace(p,
                 type = "scatter",
                 mode = "lines",
                 x = density(subset_data$Time)$x,
                 y = density(subset_data$Time)$y,
                 line = list(color = line_colors[i], width = 3, 
                             dash = ifelse(year %% 2 == 0, "solid", "dash")),
                 name = as.character(year))
}

p

The above plot is interactive. Click the lines in the legend to remove or add a line.

Question 2

  1. Correlating age and time: Create a scatter plot of age and race time, with time being the response. All ten year’s worth of data should be included, but you should be able to tell which year each point comes from. Include trend lines for each year, along with a trend line for the combined data set.
scat_plot = function(curr_year,cleaned_data)
{

  sub_data = cleaned_data %>% filter(Year == curr_year)
  plot(sub_data$Age ~ sub_data$Time,
       main = curr_year,
       xlab = "Time",
       ylab = "Age")
  
  abline(lm(sub_data$Age ~ sub_data$Time), col="red")

}

par(mfrow = c(2, 5))

for (curr_year in 1973:1982)
{

  scat_plot(curr_year, cleaned_data)
}

Write a short interpretation of the plot as if you were explaining it to your client.

Looking at the red line showing the general linear trend of the runners we can see that as the ages of runners increased so did the time.

Question 3

  1. Relating age and times categorically:
    We’re interested in the age composition for ten performance groups. The performance groups are defined based ten percentiles (10%, 20%,…100%) of relative finish position. For example, someone finishing 3rd out of 125 would be in the 10th-percentile group, while someone finishing 985 out of 1013 would be in the 100th-percentile group.
    The age groups we’re interested in are defined by decade, so separate people in their 20’s from people in their 30’s and so forth.
    Generate one plot that displays the age composition of each of the ten performance groups. Make sure you’re using all ten year’s worth of data.
    Hint: You can compute performance groups manually from Year and Time, or by carefully manipulating Pis/Tis.